Goto

Collaborating Authors

 Mathematical & Statistical Methods


Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification - supplementary material Francesca Mignacco

Neural Information Processing Systems

The derivation of the self-consistent stochastic process discussed in the main text can be obtained using tools of statistical physics of disordered systems. In particular, it has been done very recently for a related model, the spherical perceptron with random labels, in [1]. Our derivation extends the known DMFT equations by including structure in the data; a stochastic version of gradient descent as discussed in the main text; the relaxation of the spherical constraint over the weights and the introduction of a Ridge regularization term. There are at least two ways to write the DMFT equations. One is by using field theoretical techniques; otherwise one can employ a dynamical version of the so-called cavity method [2].


Conditional Distribution Compression via the Kernel Conditional Mean Embedding

arXiv.org Machine Learning

Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from $\mathcal{O}(n^3)$ to $\mathcal{O}(n)$. Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.


Riemannian Optimization on Relaxed Indicator Matrix Manifold

arXiv.org Machine Learning

The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of \( \mathcal{O}(n^3) \), while RIM manifold optimization is \( \mathcal{O}(n) \) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code in \href{https://github.com/Yuan-Jinghui/Riemannian-Optimization-on-Relaxed-Indicator-Matrix-Manifold}{here}.


Local Distance-Preserving Node Embeddings and Their Performance on Random Graphs

arXiv.org Machine Learning

Learning node representations is a fundamental problem in graph machine learning. While existing embedding methods effectively preserve local similarity measures, they often fail to capture global functions like graph distances. Inspired by Bourgain's seminal work on Hilbert space embeddings of metric spaces (1985), we study the performance of local distance-preserving node embeddings. Known as landmark-based algorithms, these embeddings approximate pairwise distances by computing shortest paths from a small subset of reference nodes (i.e., landmarks). Our main theoretical contribution shows that random graphs, such as Erd\H{o}s-R\'enyi random graphs, require lower dimensions in landmark-based embeddings compared to worst-case graphs. Empirically, we demonstrate that the GNN-based approximations for the distances to landmarks generalize well to larger networks, offering a scalable alternative for graph representation learning.


The Ground Cost for Optimal Transport of Angular Velocity

arXiv.org Machine Learning

We revisit the optimal transport problem over angular velocity dynamics given by the controlled Euler equation. The solution of this problem enables stochastic guidance of spin states of a rigid body (e.g., spacecraft) over hard deadline constraint by transferring a given initial state statistics to a desired terminal state statistics. This is an instance of generalized optimal transport over a nonlinear dynamical system. While prior work has reported existence-uniqueness and numerical solution of this dynamical optimal transport problem, here we present structural results about the equivalent Kantorovich a.k.a. optimal coupling formulation. Specifically, we focus on deriving the ground cost for the associated Kantorovich optimal coupling formulation. The ground cost equals to the cost of transporting unit amount of mass from a specific realization of the initial or source joint probability measure to a realization of the terminal or target joint probability measure, and determines the Kantorovich formulation. Finding the ground cost leads to solving a structured deterministic nonlinear optimal control problem, which is shown to be amenable to an analysis technique pioneered by Athans et. al. We show that such techniques have broader applicability in determining the ground cost (thus Kantorovich formulation) for a class of generalized optimal mass transport problems involving nonlinear dynamics with translated norm-invariant drift.


Operator Learning: A Statistical Perspective

arXiv.org Machine Learning

Operator learning has emerged as a powerful tool in scientific computing for approximating mappings between infinite-dimensional function spaces. A primary application of operator learning is the development of surrogate models for the solution operators of partial differential equations (PDEs). These methods can also be used to develop black-box simulators to model system behavior from experimental data, even without a known mathematical model. In this article, we begin by formalizing operator learning as a function-to-function regression problem and review some recent developments in the field. We also discuss PDE-specific operator learning, outlining strategies for incorporating physical and mathematical constraints into architecture design and training processes. Finally, we end by highlighting key future directions such as active data collection and the development of rigorous uncertainty quantification frameworks.


Proper scoring rules for estimation and forecast evaluation

arXiv.org Machine Learning

In recent years, proper scoring rules have emerged as a power ful general approach for estimating probability distributions. In addition to significantly ex panding the range of modeling techniques that can be applied in practice, this has also substantially broadened the conceptual understanding of estimation methods. Originally, proper scoring rules we re conceived in meteorology as summary statistics for describing the performance of probabilisti c forecasts ( Murphy and Winkler, 1984), but they also play an important role in economics as tools for bel ief elicitation ( Schotter and Trevino, 2014). A probabilistic forecast is a probability distribution ove r the space of the possible outcomes of the future event that is stated by the forecaster. The simple st and most popular case of probabilistic forecasts arises when the outcome is binary, so the probabilistic forecast reduces to issuing a predictive probability of success. Brier ( 1950) was the first to consider the problem of devising a scoring rule which could not be "played" by a dishonest fore casting agent. He introduced the quadratic scoring rule and showed that it incentivizes a for ecasting agent to state his most accurate probability estimate when faced with uncertainty.


A Learning Algorithm Algorithm 1: Learning algorithm for Dr.k-NN Input: S = m, 8i} S, m =1,,M; Output: The feature mapping (;) and the LFD P

Neural Information Processing Systems

B.1 Proof of Theorem 1 The proof of Theorem 1 is based on the following two lemmas. Moreover, when there is a tie (i.e., the set arg max We here prove a more general result for an arbitrary sample space . Lemma 2. For the uncertainty sets defined in (7), the problem max Proof of Lemma 2. Recall that the Wasserstein metric of order 1 is defined as W(P, P To prove Theorem 1, it remains to verify the validity of exchanging max and min. Thereby we have shown that formulations (6) and (8) have identical optimal values. Next we verify is a feasible classifier, i.e., it satisfies 0 apple () apple 1 and P Thereby we have shown (16).